Developer Release Note - Text Encoding Converter Manager 1.4
(August 2, 1998; updated September 22, 1998 - P. Edberg)


Version 1.4 of the Text Encoding Converter Manager (TEC) is included with Mac OS 8.5. This note describes changes from TEC 1.3, including the single bug fix in TEC 1.3.1.


1. Interface file changes

These are in Universal Interfaces 3.2, and will be in the interfaces included with Code Warrior Pro 4.

a) Added constant kTextEncodingUnicodeV2_1 (TextCommon.h) to indicate Unicode version 2.1. TEC 1.4 and later treat Unicode 2.0 as if it were Unicode 2.1, so the constant kTextEncodingUnicodeV2_1 has the same numeric value (0x0103) as the constant kTextEncodingUnicodeV2_0. TEC versions earlier than 1.4 do not support Unicode 2.1.

b) Added constant kTextEncodingMacUnicode (TextCommon.h), numeric value 0x007E. This is a meta-value, like kTextEncodingUnicodeDefault, and TEC handles it similary: It resolves kTextEncodingMacUnicode to an actual Unicode version, currently kTextEncodingUnicodeV2_1.

Beginning in Mac OS 8.5, the set of Mac OS script codes has been extended for some OS components to include Unicode. Some of these components have only 7 bits available for script code, so the constant kTextEncodingUnicodeDefault (0x0100) could not be used to indicate Unicode. Instead, kTextEncodingMacUnicode is used to indicate Unicode handled as a special Mac OS script code.

For example, kTextEncodingMacUnicode can be used to indicate Unicode in the 7-bit script code field of a Unicode input method's ComponentDescription.componentFlags field; it can also be used to indicate Unicode in the 16-bit script code field of an AppleEvent's typeIntlWritingCode text tag.

c) Added constants for TextEncodingVariant values that apply to kTextEncodingMacRoman (TextCommon.h). These are a consequence of the fact that the standard Mac OS Roman encoding has changed with Mac OS 8.5: The code point 0xDB, which was used for CURRENCY SIGN in earlier versions of Mac OS Roman, is now used for EURO SIGN. The relevant TEC changes are described in more detail in section 3c.

d) Added constants for more TextEncodingBase values (TextCommon.h). The corresponding encodings are not supported in TEC 1.4, but will be supported in a future TEC version:

kTextEncodingMacCeltic = 0x27 // Modified MacRoman (supports Welsh)
kTextEncodingMacGaelic = 0x28 // Modified MacRoman (Irish with dots)
kTextEncodingMacInuit = 0xEC // For Nunavut province of Canada
kTextEncodingISOLatin3 = 0x0203 // ISO 8859-3
kTextEncodingISOLatin4 = 0x0204 // ISO 8859-4
kTextEncodingWindowsVietnamese = 0x0508 // Windows code page 1258

e) Added constants for new conversion control flags for the iControlFlags parameter of ConvertFromTextToUnicode, ConvertFromUnicodeToText, etc. (UnicodeConverter.h):

kUnicodeForceASCIIRangeBit = 9
kUnicodeNoHalfwidthCharsBit = 10
kUnicodeForceASCIIRangeMask = 1L << kUnicodeForceASCIIRangeBit
kUnicodeNoHalfwidthCharsMask = 1L << kUnicodeNoHalfwidthCharsBit

Deprecated the constants for the never-supported TextEncodingVariant options that were previously intended to be used for implementing this capability (TextCommon.h): kJapaneseNoOneByteKanaOption, kJapaneseUseAsciiBackslashOption.

See section 3a below.

f) Defined new feature/fix bits (and corresponding masks) for the tecUnicodeConverterFeatures field of the TECInfo structure returned by TECGetInfo, to indicate new bug fixes/enhancements in TEC 1.4. These are (TextCommon.h):

kTECAddForceASCIIChangesBit = 4
kTECPreferredEncodingFixBit = 5
kTECAddForceASCIIChangesMask = 1L << kTECAddForceASCIIChangesBit
kTECPreferredEncodingFixMask = 1L << kTECPreferredEncodingFixBit

g) Added TextCommon.r (containing the TextEncoding constants as #defines) and UnicodeConverter.r (containing the conversion control flag constants as #defines).

h) Made the stub libraries FAT. Before TEC 1.4 they were PPC only, even though the TEC implementation shared libraries are FAT (supporting PPC and CFM 68K).

2. Implementation bug fixes

a) When HFS Extended, UDF/DVD, or PC Exchange were used, the TEC functions ConvertFromUnicodeToTextRun and ConvertFromUnicodeToScriptCodeRun would dereference a null handle, eventually causing a crash. In addition, in rare circumstances it was possible for TEC to make Memory Manager calls at interrupt time, resulting in HFS Extended catalog corruption and/or incorrect error values returned from application Memory Manager calls. This has been fixed (#2251442).

b) If the boot volume containing the Text Encoding Converter was in HFS Extended format and the name of the TEC extension or the names of any Text Encodings files were localized into non-ASCII characters, the TEC tables in those files were inaccessible. During booting, HFS Extended mangles non-ASCII names; these must be unmangled to access files after TEC has been loaded. (#2216745)

Note: There is a related fix that was made in the File Manager after Mac OS 8.1 was released (this fix is in Mac OS 8.5, and in some localized versions of Mac OS 8.1). Both changes are necessary to fix this problem. In particular, the problem is still present with TEC 1.4 on U.S. versions of Mac OS 8.1.

c) If CreateUnicodeToTextRunInfo was called with iNumberOfMappings=-1 and a preferred mapping that was different than the system script, the preferred mapping was often ignored and the system script used instead. (#2215984)

d) For ConvertFromUnicodeToText[Run]: If an undefined Unicode is not the first character in a text element, then it should just terminate the text element; the functions should not return kTextUndefinedElementErr unless the undefined Unicode begins a text element. (#2212628)
e) If an odd value was passed as the iUnicodeLen parameter to ConvertFromUnicodeToText for UCS-2/UTF-16 text, it attempted to convert the partial Unicode character (and claimed that it read one byte beyond the length passed in iUnicodeLen). It now returns kTECPartialCharErr and only converts through the last complete character. Note that iUnicodeLen can legitimately be odd for UTF-8 text. (#2212626)

f) Preserve current resource file across TextCommon cfrg initialization (and across InitializeUnicodeConverter). (part of #2203534)

g) TECConvertText could hang when processing EUC-JP if it encountered a partial character (e.g. just 0x8F) at the end of a buffer. (#2219197)

h) TECSniffTextEncoding did not record errors for invalid single-byte characters in ISO-2022-JP. (#2218976)

i) This was the single bug fix in TEC 1.3.1: TECConvertText was writing over low memory (e.g. address 8). A structure was being initialized before the pointer used to access it was initialized (the pointer was NULL). (#2203186)

3. Implementation enhancements and changes

a) Implement new UnicodeConverter options kUnicodeForceASCIIRangeBit and kUnicodeNoHalfwidthCharsBit. These are used for the iControlFlags parameter of ConvertFromTextToUnicode, ConvertFromUnicodeToText, etc. See item 1e above. (#1644700)

Implementing these options entailed rearranging some mapping tables and changing some mappings; these mapping changes (and others) are described in section 4.

b) Upgrade to support Unicode 2.1 (which becomes the default Unicode version). Unicode 2.1 adds the EURO SIGN and OBJECT REPLACEMENT CHARACTER, makes some changes to direction classes, and has a few other changes that do not affect TEC operation. (part of #2203409, #2237762)

c) Support the addition of EURO SIGN to Mac OS Roman and Mac OS Symbol. These encodings --and the fonts associated with them that Apple ships--are changing for Mac OS 8.5 to support the new EURO SIGN character.

Mac OS Roman had no unassigned code points. So code point 0xDB, which was formerly CURRENCY SIGN, has been reassigned as EURO SIGN. TEC handles this as follows (#2203409, #2257036):
Mac OS Symbol had several unassigned code points. One of these, 0xA0, is now assigned EURO SIGN. So the mapping for kTextEncodingMacSymbol is changed as follows: 0xA0 now maps to Unicode 0x20AC (EURO SIGN) when mapping to Unicode 2.0 or 2.1, and to private-use Unicode 0xF8A0 when mapping to Unicode 1.1 (as above). (#2256792)

d) Add support for the following encodings (#2227042):
e) Add Unicode Converter support for EUC-JP. In previous TEC versions, the high-level Text Encoding Converter provided algorithmic conversion between EUC-JP and ISO 2022-JP or Shift-JIS/MacJapanese. However, this resulted in loss of the JIS X0212 characters. Now all of the EUC-JP characters, including the JIS X0212 characters, can be converted to and from Unicode. The new options kUnicodeForceASCIIRangeBit and kUnicodeNoHalfwidthCharsBit (see 3a above) can be used with EUC-JP. (#1643497)

f) Support 4-byte codeset 2 characters in EUC-TW. In previous TEC versions, ConvertFromTextToUnicode supported only the 1-byte codeset 0 and 2-byte codeset 1 characters in EUC-TW. ConvertFromUnicodeToText[Run] could map from Unicode to 4-byte characters in EUC-TW for codeset 2, planes 2 and 3, but only as loose mappings (since the 4-byte EUC-TW characters could not be mapped back to Unicode. (#1634875)

In TEC 1.4, the 4-byte characters for EUC-TW codeset 2, planes 2 and 3, can be mapped to or from Unicode as strict mappings. The 4-byte characters for EUC-TW codeset 2, plane 1, can only be mapped to Unicode as loose mappings, since the resulting Unicode characters are mapped back to EUC-TW as 2-byte codeset 1 characters (CNS plane 1 is redundantly encoded in EUC-TW).

g) GetTextEncodingName can now return many different localized names for any supported encoding (#2234345). The following localized versions are available for all encoding names (Additional localized versions, such as verChina/kTextEncodingMacChineseSimp, are available for certain names):

As part of this, the names that previously included "Unicode 2.0" now have "Unicode 2.1".

Note that the names returned by GetTextEncodingName may contain parentheses, and so cannot be used with AppendMenu or InsertMenuItem. They can, however, be used with SetMenuItemText. This is not a change with TEC 1.4, but it was not previously mentioned in the TEC documentation (it will be for TEC 1.4).

h) ResolveDefaultTextEncoding maps kTextEncodingMacUnicode (see 1b above) to kTextEncodingUnicodeV2_1. CreateTextToUnicodeInfo, CreateUnicodeToTextInfo, and other functions call ResolveDefaultTextEncoding and thus handle kTextEncodingMacUnicode. (#2254112)

i) Script.h (and Script.r, etc.) was brought up to date with many new language and region codes for Universal Interfaces 3.1, with additional updates for Universal Interfaces 3.2. UpgradeScriptInfoToTextEncoding and RevertTextEncodingToScriptInfo have been updated for TEC 1.4 to handle the new language and region codes. (#2203406)

j) Updated TECGetInfo to set the new feature bits described in section 1f above. (#2252655)

k) There were several changes to the Internet name mappings for TECGetTextEncodingFromInternetName and TECGetTextEncodingInternetName:
l) Moved as many resources as possible from the Text Encoding Converter extension to the files in the Text Encodings folder, leaving only the resources needed to support mapping between Mac OS encodings and the kUnicodeCanonicalDecompVariant variant of kTextEncodingUnicodeV2_1. Resources that were moved include:
Functionality that depends on those resources will not work if the TEC 1.4 extension is used with the Text Encodings files from TEC 1.3 (because the necessary resources will be unavailable).

m) The Mac OS 8.5 System file includes PPC versions of the TextCommon and UnicodeConverter shared libraries, as well as the resources necessary to convert between Mac OS encodings and the kUnicodeCanonicalDecompVariant variant of kTextEncodingUnicodeV2_1. These libraries and resources are duplicates of those in the Text Encoding Converter extension file from TEC 1.4.

n) Implementation shared libraries have current version of 2 for TEC 1.4. (#2252349)

4. Other mapping changes

a) kTextEncodingMacJapanese, kMacJapaneseStandardVariant:
b) kTextEncodingMacJapanese, kMacJapanesePostScriptScrnVariant (used for fonts SaiMincho and ChuGothic): Add missing mappings for characters in the range 0x86A2-0x879C. Before TEC 1.4, these mappings were only for kMacJapanesePostScriptPrintVariant. In TEC 1.4 they are added for both kMacJapanesePostScriptScrnVariant and kTextEncodingDOSJapanese. Some of the characters in this range are duplicates of standard Shift-JIS characters so the mappings are changed in TEC 1.4 to ensure roundtrip fidelity (which is not required for kMacJapanesePostScriptPrintVariant), as follows (#2245742):
c) kTextEncodingMacChineseSimp
d) kTextEncodingShiftJIS
e) kTextEncodingBig5
f) kTextEncodingGBK_95
g) Updated the mapping tables for Windows 8-bit codepages to reflect the addition of EURO SIGN, and other recent additions per Windows mapping tables posted at <ftp.unicode.org> (#2237762). The new code points are as follows:



- # -